For today’s discussion, we will review the distinction between parametric and nonparametric regression models, as well as some key properties of simple linear regression. If you wish to review materials from POL 211 & 212, you can access them through the following links: POL 211 Discussion, POL 212 Discussion.
When it comes to regression analysis, choosing the right approach is crucial for accurate predictions and meaningful insights. Two common methods used are parametric, like linear regression, and semi/non-parametric, like smoothing spline regression or Kernal regression. Each has its own advantages and disadvantages, and the choice between them largely depends on the nature of the data and the underlying relationships.
Parametric Regression Linear regression is a well-known parametric method that assumes a linear functional form for the relationship between the predictors (\(X\)) and the target variable (\(Y\)). This approach has several benefits, such as ease of estimation with a small number of coefficients. In linear regression, these coefficients have straightforward interpretations, and statistical significance tests are readily applicable. However, parametric methods come with a significant limitation — they rely on the assumption that the specified functional form is a close approximation to the true relationship. If this assumption is far from reality, linear regression can perform poorly and yield unreliable results.
Nonparametric Regression On the other hand, non-parametric methods like K-Nearest Neighbors (KNN) regression do not make explicit assumptions about the functional form of the relationship between \(X\) and \(Y\). Instead, they provide a more flexible approach for regression. KNN regression identifies the K training observations closest to a prediction point and estimates the target variable by averaging their values. While this approach is more versatile and can handle complex relationships, it can suffer from high variance when K is small, leading to overfitting. Conversely, when K is large, KNN regression can underfit the data.
Assume that we have a outcome variable \(Y\) and two explanatory variables, \(x_1\) and \(x_2\). In general, the regression model that describes the relationship can be written as:
\[Y = f_1(x_1) + f_2(x_2) + \epsilon\]
Some parametric regression models:
If we do not know \(f_1\) and \(f_2\) functions, we need to use a Nonparametric regression model.
K-Nearest Neighbors (KNN) regression is one of the simplest and best-known nonparametric methods.
Given a value for K and a prediction point x0, KNN regression first identifies the K training observations that are closest to x0, represented by N0. It then estimates f(x0) using the average of all the training responses in N0. In other words,
The key question is when to choose a parametric approach like linear regression over a non-parametric one such as KNN regression. The answer is straightforward: a parametric approach performs better when the chosen functional form is a close match to the true relationship, particularly in the presence of a linear relationship. If the specified functional form is far from the truth, and prediction accuracy is our goal, then the parametric method will perform poorly. For instance, if we assume a linear relationship between X and Y but the true relationship is far from linear, then the resulting model will provide a poor fit to the data, and any conclusions drawn from it will be suspect.
In contrast, non-parametric methods do not explicitly assume a parametric form for \(f(X)\), and thereby provide an alternative and more flexible approach for performing regression.
To illustrate this point, let’s consider a few scenarios:
TBD
TBD
TBD